[Fetch Migration] Enable migration for identical, empty target cluster index #390

kartg · 2023-11-06T17:58:01Z

Description

Category: Enhancement

Why these changes are required?

This change enables Fetch Migration to migrate data for indices that are identical between the source and target cluster, and the document count on the target cluster is zero. Such behavior allow users to execute Fetch Migration in an idempotent manner without having to manually remove empty indices on the target cluster (that were created by a previous run of Fetch Migration that did not complete successfully)

What is the old behavior before changes and new behavior after changes?

Prior to this change, all indices that were identical between the source and target clusters were ineligible for data migration. Now, we check the doc_count for identical indices on the target cluster and only exclude identical indices with a non-zero document count.

Testing

$ python -m coverage report --omit "*/tests/*"
Name                           Stmts   Miss  Cover
--------------------------------------------------
endpoint_info.py                   6      0   100%
fetch_orchestrator.py             25      0   100%
index_diff.py                     20      0   100%
index_doc_count.py                 5      0   100%
index_operations.py               49      0   100%
metadata_migration.py            118      0   100%
metadata_migration_params.py       7      0   100%
metadata_migration_result.py       5      0   100%
migration_monitor.py              94      4    96%
migration_monitor_params.py        6      0   100%
progress_metrics.py               89      0   100%
utils.py                          13      0   100%
--------------------------------------------------
TOTAL                            437      4    99%

Also tested end-to-end via the CDK. Log output from a test run:

INFO:root:Running pre-migration steps...

Identical indices in the target cluster: [logs-241998, sonested, logs-181998, logs-221998, reindexed-logs, nyc_taxis, geonames, logs-191998, logs-201998, logs-231998, logs-211998]

Identical empty indices in the target cluster (data will be migrated): [sonested, reindexed-logs, nyc_taxis, geonames]

Indices present in both clusters with conflicting settings/mappings (data will not be migrated): []

Indices to be created in the target cluster (data will be migrated): []

Total number of documents to be moved: 3000

...

Check List

New functionality includes testing
- All tests pass, including unit test, integration test and doctest
New functionality has been documented
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

codecov · 2023-11-06T18:02:29Z

Codecov Report

Merging #390 (d327ab5) into main (4497873) will increase coverage by 0.09%.
The diff coverage is n/a.

@@             Coverage Diff              @@
##               main     #390      +/-   ##
============================================
+ Coverage     63.55%   63.64%   +0.09%     
- Complexity      715      718       +3     
============================================
  Files            82       82              
  Lines          3298     3298              
  Branches        303      303              
============================================
+ Hits           2096     2099       +3     
+ Misses         1014     1011       -3     
  Partials        188      188

Flag	Coverage Δ
unittests	`63.64% <ø> (+0.09%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

see 2 files with indirect coverage changes

FetchMigration/python/metadata_migration.py

FetchMigration/python/index_diff.py

sumobrian · 2023-11-09T16:03:25Z

FetchMigration/python/index_operations.py

-    resp = requests.get(actual_endpoint, auth=endpoint.auth, verify=endpoint.verify_ssl)
+def doc_count(indices: set, endpoint: EndpointInfo) -> IndexDocCount:
+    actual_endpoint = endpoint.url + ','.join(indices) + __SEARCH_COUNT_PATH
+    resp = requests.get(actual_endpoint, auth=endpoint.auth, verify=endpoint.verify_ssl, json=__SEARCH_COUNT_PAYLOAD)


This should be wrapped in an exception block printing errors or warnings when there is a connection problem. Additionally, status codes should be checked.

I've got an open issue tracking such fixes across the implementation

deployment/cdk/opensearch-service-migration/dp_pipeline_template.yaml

sumobrian · 2023-11-09T19:15:16Z

deployment/cdk/opensearch-service-migration/dp_pipeline_template.yaml

  sink:
  - opensearch:
-      bulk_size: 10
+      # DO NOT CHANGE the hosts value - this will be updated by the CDK


Same as comment above: I would add a note that they would need to change these only if running locally (in Docker)

After the most recent updates to the Fetch Migration CDK, the hosts value in the sink will be replaced by an inline target host value through the CDK, so changing this value will not affect that logic nor reflect in the final setup. I've updated the comment here to:

# Note - CDK code will replace this value with the target cluster endpoint

This is in contrast to the hosts value in source where the string value matters because the CDK performs a string substitution to inject the source cluster endpoint.

This is encapsulated in an IndexDocCount class, which also track the total doc count Signed-off-by: Kartik Ganesh <[email protected]>

…cluster indices This change includes refactoring the index difference logic to an IndexDiff class, and an additional network call to the target cluster to fetch document counts for identical indices to determine which ones are empty. Signed-off-by: Kartik Ganesh <[email protected]>

This change adds documentation to the dp_pipeline_template.yaml file and changes some reasonable defaults. It also bumps the Fetch Migration task definition to 1 CPu and 4 GB memory. Signed-off-by: Kartik Ganesh <[email protected]>

Signed-off-by: Kartik Ganesh <[email protected]>

kartg requested review from chelma, gregschohn, lewijacn, mikaylathompson, okhasawn and sumobrian as code owners November 6, 2023 17:58

kartg force-pushed the fetch-doc-count-improvements branch 3 times, most recently from d968a25 to 29b6d8f Compare November 8, 2023 18:18

sumobrian requested changes Nov 9, 2023

View reviewed changes

kartg force-pushed the fetch-doc-count-improvements branch from c9fc513 to fd9edd2 Compare November 10, 2023 19:40

kartg added 6 commits November 10, 2023 11:47

doc_count now doc-counts by index

87bea49

This is encapsulated in an IndexDocCount class, which also track the total doc count Signed-off-by: Kartik Ganesh <[email protected]>

Bugfix to jsonpath library name in requirements.txt

e5522ce

Signed-off-by: Kartik Ganesh <[email protected]>

Minor - add unit test for endpoint_info URL normalization

7f57689

Signed-off-by: Kartik Ganesh <[email protected]>

Incorporating PR feedback

54bead7

Signed-off-by: Kartik Ganesh <[email protected]>

kartg force-pushed the fetch-doc-count-improvements branch from fd9edd2 to 54bead7 Compare November 10, 2023 20:26

[Bugfix] Use 'r+' instead of 'rw' in fetch_orchestrator.py

d327ab5

Signed-off-by: Kartik Ganesh <[email protected]>

sumobrian approved these changes Nov 11, 2023

View reviewed changes

kartg merged commit fb0553a into opensearch-project:main Nov 11, 2023
8 checks passed

kartg deleted the fetch-doc-count-improvements branch November 11, 2023 01:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fetch Migration] Enable migration for identical, empty target cluster index #390

[Fetch Migration] Enable migration for identical, empty target cluster index #390

kartg commented Nov 6, 2023 •

edited

Loading

codecov bot commented Nov 6, 2023 •

edited

Loading

sumobrian Nov 9, 2023

kartg Nov 9, 2023

sumobrian Nov 9, 2023

kartg Nov 10, 2023

[Fetch Migration] Enable migration for identical, empty target cluster index #390

[Fetch Migration] Enable migration for identical, empty target cluster index #390

Conversation

kartg commented Nov 6, 2023 • edited Loading

Description

Testing

Check List

codecov bot commented Nov 6, 2023 • edited Loading

Codecov Report

sumobrian Nov 9, 2023

Choose a reason for hiding this comment

kartg Nov 9, 2023

Choose a reason for hiding this comment

sumobrian Nov 9, 2023

Choose a reason for hiding this comment

kartg Nov 10, 2023

Choose a reason for hiding this comment

kartg commented Nov 6, 2023 •

edited

Loading

codecov bot commented Nov 6, 2023 •

edited

Loading